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Abstract 

We investigate multi-task learning from an output space regularization perspective. Most 
multi-task approaches tie together related tasks by constraining them to share input spaces 
and function classes. In contrast to this, we propose a multi-task paradigm which we call 
output space regularization, in which the only constraint is that the output spaces of the 
multiple tasks are related. We focus on a specific instance of output space regulariza- 
tion, multi-task averaging, that is both widely applicable and amenable to analysis. The 
multi-task averaging estimator improves on the single-task sample average under certain 
conditions, which we detail. Our analysis shows that for a simple case the optimal simi- 
larity depends on the ratio of the task variance to the task differences, but that for more 
complicated cases the optimal similarity behaves nonlinearly. Further, we show that the 
estimates produced are a convex combination of the tasks' sample averages. We discuss 
the Bayesian viewpoint. Three applications of multi-task output space regularization are 
presented: multi-task kernel density estimation, multi-task-regularized empirical moment 
constraints in similarity discriminant analysis, and multi-task local linear regression. Ex- 
periments on real data sets show statistically significant gains. 

Keywords: multi-task learning, similarity-based regularization, output space regulariza- 
tion, manifold regularization, kernel density estimation 



1. Introduction 

The motivating hypothesis of multi-task learning (MTL) is that combining data from related 
estimation tasks can yield superior estimates over learning each task separately. A central 
question of MTL is: how should the tasks be combined or tied together? Consider the 
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following visualization of a T-task MTL estimation problem: 

X\ > y\ 



v ft(xu;Pt) ,, 
<*t r yt 

v fT{x Ti ;^ T ) 

A-t > yr, 

where X% is the ith tasks' input space, 3^ is the ith tasks' output space, and ft(%u,Pt) is 
the £th tasks' function that maps input xu E Xt to output yu £ yt and parameterized by 
set Pf From the above figure, one observes that the tasks can be tied together through 

(a) the input spaces {X±, . . . , Xt}; 

(b) the parameter or function spaces, and/or, 

(c) the output spaces {3^, ■ • • ,^t}- 

Prior MTL approaches have focused on (a) and (b), assuming that tasks are related 
in that they all share an input space (i.e. X± = X for all t) and/or function class (i.e 
ft(x; (3 t ) = f(x;0i) for all t). Some examples: 

• Joint dimensionality reduction/selection with respect to all tasks at once (input space) . 



E.g. Obozinski et al. (2010) 



Jointly regularizing the function parameters j3t (parameter space). E.g. Argyriou 



et al. (2007) 



Placing joint priors on the (it for all t (parameter space). E.g. Bakker and Heskes 



(2003) 



Defining multi-task kernels (implicit input space). E.g. Micchelli and Pontil (2004). 



The possibility of (c) has been largely neglected; however, in this work we argue that it 
is a natural way to think about the relatedness of tasks. For example, two tasks that we 
believe should benefit from multi-task learning are "How long do trees live in the vicinity of 
a nuclear meltdown?" and "How long do humans live in the vicinity of a nuclear meltdown?" 
These two tasks share an output space (and are likely correlated), but the tasks' domains 
(trees and humans) may be described by unrelated features. In this paper we investigate 
cross-task regularization of the task-specific functions' outputs, a paradigm of multi-task 
learning that we call output space regularization (OSR). We show that OSR is both powerful 
and widely applicable. OSR enables MTL for tasks that do not share an input space or a 
function class; the tasks need only be related in that they aim to estimate related outputs. 
The task-specific input spaces and task-specific functions can be completely different. 

To elucidate the difference between standard MTL parametric regularization approaches 
and OSR, consider the following two example MTL objectives. Let (xu,yti) be input-output 
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training pairs for the ith task, with iVj pairs per task. Let A S M be the task similarity 
matrix, where A rs is the similarity between tasks r and s. The following MTL objective 
explicitly ties the tasks together in the parameter space: 

T N t T T 

argmin^^(y ij -/(x ti ;/3 < )) 2 + 7 ^^^ s ||/3 r -/3 s ||2. (1) 

{PtitLl t=l i=l r=l s=l 

This regularization results in parameter estimates that are more similar. Consider on the 
other hand the following OSR objective 

T Nt T T N r N B 

argmin^^(y ti - f t (x ti ; /3 4 )) 2 + 7^ ^ ^ ^ Kj;sk{fr(x rj ;f3 r ) - f s {x sk ; /3 S )) 2 , (2) 
{ftlLi t=l i=l r=l s=l j=l k=l 

where A r j. s fc is the task-and-sample similarity between x r j and x sk . The non-parametric 
form of (pj regularizes the outputs {f(xtf, fit)} of the different tasks, resulting in parameter 
estimates that correspond to output estimates that are more similar. 

In this paper, we will show that OSR is a flexible, widely applicable, and effective 
method with which to approach multi-task learning problems. In the next section we give 
the general form of OSR, after which we review related work in multi-task learning, as well 
as discuss how OSR is related to manifold regularization (Belkin et al. 2006) in Section |3j 



Through the lens of OSR, many estimation problems can be cast in a MTL framework, 
and potentially benefit from additional data sharing. In Section |4j we investigate the 
simplest application of OSR, which we termmulti-task averaging (MTA). The tractability 
of MTA enables an extensive analysis, given in Section [5j which provides some new intuition 
and insights about OSR in general. MTA is broadly applicable; we illustrate its use for multi- 
task kernel density estimation (MT-KDE) in Section pi and for specifying the empirical 
moment constraints in similarity discriminant analysis in Section [7j Then, in Section [8j we 
present a second formulation of OSR - multi-task linear regression. We interpret local linear 
regression as a multi-task estimation problem, and show that OSR local linear regression 
produces a smoother function over the domain of printer color management look-up-tables. 
We note that, with OSR, any local learning problem can be formulated as a MTL problem. 
The paper ends with conclusions and open questions. 

2. Output Space Regularization 

In multi-task learning, we are interested in learning T functions ft(xu] fit), where j3t is a set 
of parameters for the £th task, and xu is the ith input from task t. Suppose one is given 
N t training input-output pairs {{xu,yu)}i=i for the tth task for t = 1, . . . ,T. Note that x 
and y can be any objects, as long as the function outputs are comparable with y. Let L 
be a loss function, J be a regularization function, and 7 be a scalar regularization factor. 
Additive regularization in multi-task learning is predominantly parametric, that is, it can 
be formulated as 

T N t 
argminV V L(y ti , f t (x u ; &)) + jJ({/3 r }^ =1 ). (3) 

{PtYU t=i i=i 
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In this paper we focus on the alternative approach where the regularization is a function 
of the output values, such that the objective function can be written 



T N t 



argmin V V L(y ti , f t (x u ;(3 t )) + jJ({f r (x rj ; &)&), for r 6 {1, ... , T}. 



(4) 



{MLi t=i i=\ 



OSR is complementary to many parametric multi-task regularization strategies, and can 
be used in conjunction with them. Also, OSR is easy to generalize to transductive or semi- 
supervised learning by regularizing function outputs for unlabeled training data and/or test 
samples. In such cases, let \z r j\ r l\ be a set of samples that includes some combination or 
subset of the training samples, test samples (transductive), and unlabeled training samples 
(semi-supervised), where M r is the total number of samples from the rth tasks used in the 
regularizer. Then Q generalizes to 

T N t 

argmin^ £ L(y ti , f t (x ti ; &)) + jJ({fr(z r f, Pr)}f=i), for r € {1, ... , T}, 



(M T 



t=i t=i i=i 



All regularization functions J used in this paper require information about the related- 
ness of multiple tasks which we specify as a similarity matrix between tasks and sometimes 
samples. 



3. Related Work in Multi-Task Learning 



A related multi-task approach that focuses on function outputs is the work of Bonilla et al. 



(2008), where a Gaussian process prior is placed over the latent task functions to constrain 



their inner products. This method is limited only to Gaussian process regression. 

Next, we review other approaches to tying together tasks in various machine learning 
problems. 

One of the dominant approaches is the use of an explicit multi-task parameter regular- 
izer. Evgeniou and Pontil (2004), for instance, penalized the distance of model parameters 
Pt to 7j YL r Pr, the average of all tasks' model parameters. Argyriou et al. (2007) studied 



regularizers in which the spectral norm of the matrix j3 (the tth column is fit) is penalized. 



Kato et al. (2008) formulated a multi-task SVM by imposing a constraint on the difference 



between task parameters. 

Also common are Bayesian frameworks in which shared statistical structures of the 
parameters are learned or imposed. In the work of Bakker and Heskes (2003) some of the 



model parameters are shared amongst tasks, while others are connected through a joint 
prior. Xue et al. (2007) drew the /3j from a Dirichlet process prior, and Liu et al. (2009) 



placed a joint distribution over the tasks' parameters in a semi-supervised framework. 
Another approach is the construction of MT kernels. 



Micchelli and Pontil (2004) ex- 



plored the linear combination of kernels for multi-task learning, and Evgeniou et al. (2005) 



showed that the problem of estimating T task functions with certain types of regularization 



can be cast as a single-task problem using multi-task kernels. Sheldon (2008) built on this 



work by proposing a graphical multi-task learning framework using graph Laplacians. 

Other work focuses on tying tasks together by jointly selecting features or learning sub- 



spaces. Argyriou et al. (2008) use a (2, l)-norm on the matrix j3 to encourage a small-number 
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of non-zero rows. Obozinski et al. (2010) build on this work to obtain a computationally 



efficient joint subspace selection. 

Another major question in multi-task learning is how to estimate the similarity (or task 
relatedness) between tasks and/or samples if it is not provided. The standard approach, 
taken by many of the above-cited papers, is to estimate the similarity matrix jointly with 



the task parameters (Zhang and Yeung, 2010 



Xue et al. , 2007). As a more detailed example, 



Argyriou et al. 



Zhang and Yeung 



2007, 



Bonilla et al. 2008 



(2010) assumed that there 



exists a covariance matrix for the task relatedness, and proposed a convex optimization 
approach to estimate the task covariance matrix and the task parameters in an alternating 
way. 

Our work does not focus on the question of similarity matrix estimation, and, for all of 
our experiments, the task similarity matrix was either provided by domain experts or spec- 
ified from domain knowledge. It is straightforward to modify any of the OSR formulations 
in this paper so that the similarity matrix between the tasks and/or samples is learned in 
conjunction with the task parameters. 



OSR is similar to manifold regularization (Belkin et al., 2006) in that the function 



outputs are tied together, instead of the function parameters. One can view manifold 
regularization (without the RKHS function regularizer) as a special 1-task case of OSR. For 
example, Belkin et al.'s Laplacian-regularized least squares objective for semi-supervised 
regression solves 



arg mm 



YliLiiVi 



ffa)) 2 + A||/|& +7E£f AUfixt) - f( Xj )f 



where / is the regression function to be estimated, % is a reproducing kernel Hilbert space 
(RKHS), N is the number of labeled training samples, M is the number of unlabeled training 
samples, Af- is the similarity (or weight in an adjacency graph) between feature samples 
Xi and Xj, and \\f\\u is the norm of the function / in the RKHS. In OSR, as opposed 
to manifold regularization, we are estimating a separate function for each of the T tasks. 
Given that OSR allows for T unrelated input spaces and functions, it is an open question 
whether the mathematics that describe the manifold in Belkin et al.'s work still hold. 

A different approach to using manifold regularization concepts for MTL was recently 



proposed by Agarwal et al. (2010) for parametric multi-task learning. They assume that the 



task parameters lie on a low-dimensional manifold, and alternate estimating the manifold 
with learning the tasks in an iterative way. Their approach is restricted to a RKHS and 
assumes a global manifold structure for all tasks, which share a common parametric model. 



4. MTA Formulation 

In this section, we apply OSR to the problem of estimating multiple means, yielding what we 
call multi-task averaging (MTA). At first glance, MTA may strike one as a trivial formulation 
of OSR. We have found, however, that it leads to fruitful analysis, and is surprisingly 
applicable. One useful application of MTA is the novel multi-task kernel density estimation, 
which we propose in Section [6J Also, we found that MTA alleviates the problem of high 
variance in similarity discriminant analysis, as we show in Section [7} 

With squared-loss, MTA has a closed-form solution, and we are able to state some 
analytic results in the next section that, while important in their own right, also provide 
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intuition for more complicated problems. We give conditions for which the MTA estimates 
will be better than the task sample averages. We explore the key question, "How should the 
task similarities be defined?" We show the intuitive result that for the simplest two-task case 
the optimal similarity between the two tasks depends on the ratio of the task variance to 
the task mean differences. However, we show that for more complicated cases, the optimal 
similarity behaves unintuitively. Our main theorem is that the MTA estimates are a convex 
combination of the individual tasks' samples averages. We discuss the Bayesian viewpoint 
on MTA in Section |5.6| Key notation is given in Table [T] 

4.1 MTA Objective 

Consider the T-task problem of estimating the means of T random variables. To obtain the 
MTA objective from OSR, we simply replace ft with yt in (pj We assume that for the tth 
random variable one is given Nt samples {yti\iA\i where each ya S R is an IID draw from 
the task-specific random variable Yt- In addition, we assume the T x T matrix A describes 
the relatedness or similarity of any pair of the T tasks, with Au = for all t. The proposed 
MTA estimates are 

T Nt T T 

{Vt}f=i = argmin^^(yti-i/i) 2 + 7^^^ rs (y r -i/ s ) 2 . (5) 

{yt}f=i t=l i=l r=l s=l 

The first term minimizes the empirical loss, and the second term jointly regularizes the esti- 
mates. The regularization parameter 7 trades-off the empirical risk term versus smoothing 
the task estimates together. Note that if 7 = 0, ^ produces the T sample averages. The 
MTA problem can be interpreted as learning a constant function from noisy samples where 
the "parameter" j3 and the function output are the same thing. In other words, for the 
MTA case, output space regularization and function space regularization coincide. 
A more general formulation of MTA is 

T N t 

{vt}t=i = argmin ^ ^ L(y u , yt) + 7 J {{yt}f=i) , 

{yt}f=i t=i i=i 

where L is some loss function and J is a regularization function. If L is chosen to be any 
Bregman loss, then setting 7 = will produce the T sample averages. For the analysis and 
experiments in this paper, we restrict our attention to the squared error formulation given 
in @. 

4.2 Closed-form Solution for the Scalar Case 

We provide a closed- form solution for (|5j). Let y £ M. and A £ Mr have components 

~ = 2^=1 v_u , 6 x 

A l{ - Atr + Art) (7) 

tr ^+ 7 E 8 ^(4 + 4)' u 
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Table 1: Key Notation 

yti ith sample from ith task 

y\ Multi-task averaging estimate for ith task 

y\ Multi-task regularized sample average for ith task 

y~t Sample average for ith task 

yt An estimate of the mean of ith task 

Y* Matrix with columns y^ 



The closed-form solution given below in QSJ) will hold for any A such that I — A is in- 
vertible. The following lemma shows that, for invertibility of I — A, it is sufficient (though 
not necessary) for A to have non- negative entries. 

Lemma: Suppose all the entries of matrix A are non-negative and for all tasks we have 
at least one sample, that is, iVj > 1 for all i G {1, ... ,T}. Then matrix I — A is invert- 

-A r iA r 2 



2(N 1 +N 2 )'y- 



ible. In addition, for the T = 2 case, I — A is invertible for all A, except when a 

Proof: See Appendix A. 

If (/ — A) is invertible, then it can be shown that the MTA objective given in (I5J) is 
convex and has the closed-form solution 

y* = (I-A)- 1 y, (8) 

where I is the T x T identity matrix. 

Should one find that I — A is not invertible (if e.g. some entries of A are negative), 
that problem can be solved by a slight perturbation of the matrix A, or by optimizing (pi) 
directly. 

4.3 Closed-form Solution for the Vector Case 

MTA can also be applied to vectors, with a closed-form solution. For this section only, let 
yu G M and yt £ M be vectors. The MTA objective in A5n generalizes to: 



T N t T T 

{Vt}I=i = argmin^^(y ti - y t ) T \y u - y t ) +7^^^ rs (y r -y s ) T {y r -y s ). (9) 



{yt}?Li t=i i=\ 



r=l s=l 



Using the fact that m(a — b) T {a — b) = —2(a — 6), we can obtain a closed-form solution for 
(19]). The definition of yt is the same as in (pi), except with vectors replacing scalars. The 
definition of A is unchanged from the scalar case. Let Y* £ ]R rfxT be a matrix with y\ as its 
ith column and let Y £ R be a matrix with yt as its ith column. Then (I9J) is solved by 

Y* =Y(I-A)- 1 , 

which generalizes the scalar solution ([8]) if column vectors are changed to row vectors and 
vice- versa. 
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5. Analysis of MTA 

In this section we present some analysis of MTA. Throughout, /x G K denotes the Txl 
vector of means for the T tasks, assumed to be finite, and capital F's will be used to denote 
random variables. We take as given Nt IID random variables {Yu} for the ith task with 
i = 1, . . . , Nt- We compare with the vector of sample averages Y 6 M?, where 



N t 

N t 



Yt -w^ Yti - 



i 

We do not assume any parametric form for the underlying sample distributions, but note 
that for many common parametric assumptions (such as Gaussian or Laplacian), the sample 
average is the maximum likelihood estimate. 

We first present in-depth analysis of the T = 2 case, starting with a special symmetric 
case, and then treating the general T = 2 case. We give conditions when the MTA estimates 
are better than the sample averages, provide a formula for the optimal task-similarity matrix 
A, and show the optimal A sometimes behaves unintuitively. We also prove for T = 2 that 
MTA has a smaller MSE than multi-task regularizing of the sample averages; we hypothesize 
that this is true in general. 

Then we turn to the general case for T tasks. A surprising result is that, under certain 
conditions, the MTA estimates are convex combinations of the T sample averages. Lastly, 
we note that MTA is asymptotically unbiased. 

5.1 Analysis for Symmetric T = 2 Case 

We first analyze the simplest case of T = 2 symmetric tasks, with N samples per task. 
Specifically, suppose Y\ is distributed with finite mean fii and finite variance a 2 , and Yi is 
distributed with finite mean \ii = /ii + A and finite variance a 2 . Let the task-relatedness 
matrix be A = [0 a; a 0]. For a fixed A, the regularization parameter 7 in (pi) is a useful 
degree of freedom to control the power of the regularizer. But for this analysis we treat A 
as a variable (that is, we treat a as a variable), and thus the 7 is an unnecessary degree 
of freedom, so without loss of generality we set 7 = 1/2. For this simple choice of A, the 
matrix inversion in the closed-form solution ([8]) can be solved analytically, producing the 
following properties. 

Regularized Estimate: The MTA estimate is a convex combination of the sample aver- 
ages (for a > 0): 

H^) Fi+ (^) F2 - (io » 

Bias: From ( 10 ) it follows that the MTA estimate is biased: 



■^-/"-Ij^lA. («) 
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Variance: From (|10|) it follows that the MTA estimate has a smaller variance than the 

'N 2 + 2aN + 2a 2 * 



sample average (for a > 0): 



Varfy^] 



<j- 



N \N 2 + AaN + 4a 2 



< Var[Yi]. 



(12) 



Mean-Squared Error: From (11) and (12), the mean-squared error for estimating only 



Hi (or only \x 2 -, because the two tasks are completely symmetric for this case) is 






N 2 + 2aN + 2a 2 



{N + 2af 



+ A' 



(N + 2a) 



MSE[Y{] 
Comparing to MSE[Yi] = a 2 /N, one can conclude from (13) that 



2' 



MSE[Yi*] < MSE[y] if A 2 < 2a 2 



1 1 

N + a 



(13) 



(14) 



Thus the MTA estimate has lower MSE if the mean-separation A is small compared to 
the variance weighted by a factor dependent on the similarity and the number of samples. 



Note that as a approaches from above, the RHS of ( 14 ) approaches infinity. Thus a small 



amount of regularization can be helpful even when the mean difference A is large. We 
illustrate this relationship in Fig. [T] Note further that the fact a > was not used in 
derivation of (14). 



Optimal Task Relatedness Information: The MSE given in ( 13 ) is a convex function 



of a, thus the first derivative can be used to specify the optimal value for a. This gives in 
the surprisingly simple result that the MSE[Yj*] is minimized by the task-similarity value 



a 
A 2 ' 



(15) 



This result is key because it specifies that the task-similarity a ideally should measure the 
variance of the task samples relative to the task mean-difference. In fact, a* is the inverse 
of the squared Mahalanobis distance between H\ and /j> 2 . Further, the ideal task-similarity 
is independent of the number of samples N. Intuitively, in the limit case that the difference 
in the task means A — >• 0, the optimal task-relatedness a* — > oo, and the weights in (10) on 



Y\ and Y 2 become 1/2 each. We illustrate the effect of choice of a and the optimal a* in 
Fig. § 

5.2 Analysis for General T = 2 Case 

Next we generalize the above analysis to the general case for two tasks. Consider the same 



set-up as Section 54, except let there be N\ samples of Y\ and iVjj samples of Y2, and let 
the corresponding variances be o\ and a\. For shorthand, let 

z = N ± N 2 + aNi + aN 2 . 

The MTA estimate is again a convex combination of the sample averages (for a > 0) : 

JVi(jy2 + a) - N 2 a- 
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Figure 1: Plot shows squared error summed for two tasks, averaged over 10,000 runs of the simulation. 
For each task there are N IID samples, for N = 2, 10, 20. One task generates samples from a 
standard Gaussian. The other task generates samples from a Gaussian with a 2 = 1 and varying 
mean value, as marked on the x-axis. The symmetric task-relatedness value was fixed at a — 1 
(note this is generally not the optimal value, see Fig. |2p . One sees that given only N = 2 
samples from each Gaussian, the MTA estimate is better if the Gaussians are closer than 2 units 
apart. Given N = 20 samples from each Gaussian, the MTA estimate is better if the Gaussians 
are closer than 1.5 units apart. In the extreme case that the two Gaussians have the same mean 
(pi — fi2 = 0), then with this suboptimal choice of a — 1, MTA provides a 45% win for N = 2 
samples, and a 15% win for N = 20 samples. 



and the MSE is 



MSE[ii*] 



a 2 Ni 






(16) 



Once again we compare to MSE[Yi] and get 



MSE[y x *] < MSE[yi] if A 2 < 



2a\ 



+ 



2af 

N 2 



+ 






11 
N 2 



We illustrate the MSE given in ([16} in Fig. [3} 

The optimal task-relatedness value a depends on whether one's goal is: (i) to minimize 
the MSE only on the first task regardless of error on the second task; or (ii) to minimize 
the summed MSE over both tasks. For Case (i), analyzing the first derivative of the MSE 



in ( 16 ) shows that the MSE for the first task is minimized by 



ot 



A 2 



N 2 



T 2 ' 
T 2 



(17) 



with swapped indices to compute a|. In this case the MSE is not convex, but an analysis 



of the second derivative shows that (17) holds in all cases except if the number of samples 
from the second task N% = (<r 2 — uD/A 2 . 



10 
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Figure 2: Plot shows the expected squared error summed for two tasks as given in (13 1, where the task 
samples were drawn IID from Gaussians A/"(0, 1) and 7V(1, 1). The task-relatedness value a 
was varied as shown on the x-axis. The minimum expected squared error is marked by a *, is 
independent of N and matches the optimal task-relatedness value given by (151. Given N — 1 



samples from each task, using the optimal task-relatedness a for this set-up rather than a = 
results in a 1/3 reduction in expected squared error; given N = 10 samples from each task, using 
the optimal a results in a 9% reduction in error. 
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Figure 3: Plot shows the average over 10,000 simulation runs of the sum of the two squared errors for two 
asymmetric tasks. For all the results shown, Ni — 5 samples are drawn from a standard normal, 
and A^2 = 10 samples are drawn from a second Gaussian whose mean was varied and plotted on 
the a>axis, and whose variance was either a\ — .2, 1, 5, corresponding to the three sets of lines 



shown. The similarity for each simulation was set to the optimal similarity as given by ( 18 1 



For Case (ii), analyzing the first derivative of the sum MSE shows that the sum MSE 
is minimized by 



nM + nM 



A 2 (N? + N%) 



11 l 



a, 



)(JVi-JNT 2 ) 



(18) 
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Analysis of the second derivative shows that this minimizer always hold for the cases of 
interest (that is, for Ni,N% > 1). 



A key consequence of (18) is that the optimal similarity a* is higher if there are more 
samples from the lower variance task, that is, if Ni > N2, then a smaller o\ leads to a 



higher a*. Fig. ^ shows results with the optimal similarity given by (18). In Fig. K^ 
task two has more samples, and when its variance is a\ = .2, the advantage of MTA is 
proportionally greater than when the task two variance is a\ = 5. The dominant effect 
in such cases is that MTA greatly decreases the error for the higher-variance less-sampled 
task, sometimes at the cost of a small increase in error of the more-sampled class. Even 



lower error can be achieved by using the optimal similarity for each task from (17) rather 



than using one similarity optimized for both tasks. We compare these two strategies in Fig. 
|4j which shows that a moderate gain is possible if one has the knowledge to optimally set 
the task-relatedness independently for two tasks. 

Surprisingly, while the MTA MSE varies smoothly as a function of any of its parame- 
ters, the optimal task-relatedness value a* can show large jumps and can be negative. To 



illustrate the unintuitiveness of (18), we plot the optimal similarity given in (18) for two 



slightly differing parameter sets in Fig. [5j 
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Figure 4: Plot shows the average over 10,000 simulation runs of the sum of the squared errors for two 



asymmetric tasks where the optimal similarity was set the tasks jointly using (18 1, or set for 
each task separately using (171. Other parameters were set to: JVi = 5 samples are drawn from 



a standard normal, and ^2 = 10 samples are drawn from a second Gaussian whose mean was 
varied and plotted on the a:-axis, and whose variance was a\ — .2. 



5.3 Estimating the Optimal Task Similarity 

Examining the formulas for the optimal task relatedness value a* in Section [5j one observes 
that the similarity matrix A should reflect the relatedness of the underlying tasks' statistics 
and not necessarily their semantic relatedness. This observation predates MTL by a few 



decades; Efron and Morris (1977) noted that shrinking baseball statistics towards imported 
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Figure 5: Plot shows the optimal similarity for two tasks as given by (18 1 as a function of the number 
of samples JVi drawn from a standard normal. The other parameters were fixed at N<z — 10, 
5, and two choices of mean separation are shown, A = .28 and .29. 



a\ 



car prevalence data can decrease estimation error, though few would call these two tasks 
semantically similar. 

In practice, however, we expect that the given task-similarity matrix A encodes domain 
knowledge reflecting semantic similarity and is not (necessarily) statistically optimal. We 
illustrate in our experiments that such task relatedness information is often available, and 
advantageous to incorporate into the estimation. If no task relatedness information is 
provided, it may be necessary to use the above formulas to estimate the optimal task 
similarity matrix. If, on the other hand, an expert-generated A is available, it may be of 
further benefit to combine it with an estimate of the theoretically optimal A. For example, 



based on ( 15 ), we hypothesize that it may be effective to estimate a as the average standard 



deviation of each task's samples, and estimate A as the difference of the sample means. 
This data-dependent approach is analogous to empirical Bayesian methods in which prior 



parameters are estimated from data (Casella, 1985). 



5.4 MTA Estimates Are Convex Combinations of Sample Averages 

From (fsl) and (pi) it follows that the MTA estimates are linear combinations of the task- 
specific sample averages {y~t}- However, as the special case of T = 2 above suggests, we 
can make an even stronger statement (with the addition of some constraints): the MTA 
estimates are convex combinations of the T sample averages. An important consequence 
of this theorem is that the MTA estimates preserve convexity constraints; that is, if the 
samples from all T tasks are from a convex set C, then the theorem guarantees that the 
MTA estimates will also be from the convex set C. 
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Theorem: Suppose all the entries in matrix A are non-negative. Then if Nt > 1 for all t, 

the MTA estimates are convex combinations of the sample averages. 

Proof: The proof is given in Appendix B. 

5.5 Asymptotic Analysis 

The (q, r) entry of matrix A given by (M) goes to as Nt goes to infinity. This means 
that the whole matrix A converges to in norm and since matrix inversion is a continuous 
operation, (J — A) -1 — y I in norm. Because Yt — Yt —>■ as Nt —> oo, by the law of large 
numbers we can conclude that Y* asymptotically approaches the true mean /i. 

5.6 Bayesian Interpretation of MTA 

In this section, we consider MTA in a Bayesian framework, and discuss related work in 
James-Stein and empirical Bayesian estimation. 

The MTA objective given in (J5J) is a structural risk minimization. Many structural risk 
minimization problems can be interpreted as maximum a posteriori (MAP) estimates. For 
example, consider the ridge regularized analogue to §5§: 

T N t T 

argmin^^(y ti - y t ) 2 + 7^#j? =argmaxp({y 4i }|y)p(y) (19) 

{&}tei t=l i=l r=l {Vt}t=i 

= argmaxp(y|{y ti }), 

where the likelihood p({yi}\y) is the normal distribution M(y,I) and the prior p(y) is the 
normal distribution A/"(0, 1/^f)- 

The Bayesian interpretation of (pi) has the same form as (19), where again the likelihood 



p({lJi}\y) 1S the normal distribution N~(y,I), but p(y) is now the improper prior 

nly) (X 6~ ' ) '^r = l2Js = l ArsOlr — Vs) 

T T 

= nii e ~ 7j4rs(2}r ~ $s)2 - ( 2 °) 

r=l s=l 



The improper prior (20) is a product of Gaussian priors on the differences. Note that in 



general this does not imply that the differences are independent. Consider for example the 



case of T = 3 tasks. Then, assuming the differences are independent, (20) becomes 



Yi - % ~ A/"(0, 1/27A12) (21) 

% - % ~ M(0, 1/2 7 A 23 ) (22) 

y x _ y 3 ~AA(0,l/2 7 A 13 ), (23) 

and note 

Y-Y 3 = (Yi -%) + (%-%). (24) 
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Here, since the differences are independent, the variances on the right-hand side of (24) 



must add, and (21)-(23) would imply that 



1 



1 



1 1 

-. — - -. h -. — and - — 

A13 A\2 ^23 A\2 



11 111 

-. h . — and - — = \- - — . 

A\z A23 ^.23 -^12 -4-13 



(25) 



The three constraints in (25) cannot be satisfied by any positive A, and for any T > 2 the 



analogous set of T{T — l)/2 constraints will also be impossible to satisfy. 

The proposed MTA estimator in (pi) differs from the related work in standard Bayesian 
and empirical Bayesian multi-task estimators. For example, a standard Bayesian multi- 
task estimator would tie the tasks together through a prior on the unknown means to be 
estimated, e.g., 



Y ti ~N(Y u a 2 ) 
Y t ~N(z,r 2 ), 



(26) 

(27) 



where the prior parameters z and r and the sample deviation a 2 would be assumed known. 
This Gaussian model results in the estimate 



JJt 



a 2 +t 2 a 2 + r 2 



(H, 



such that the tying between the tasks all occurs through the (assumed) known prior pa- 
rameters. 



Empirical Bayesian estimators use standard Bayesian models as in (26) and (27), but 



differ in that all the necessary parameters are estimated from the data rather than assumed 



(Casella. 1985). Analogously, we could estimate the matrix A from data; this is discussed 



further in Section [573 



The James-Stein estimator (Casella 1985) makes the assumption (26) with a known a 2 , 
then estimates 

(T-2)a 2 ~ 



y 



1 



Ml 



//. 



Thus, the James-Stein estimate is significantly different from the proposed estimator, and 
as with many empirical Bayesian multi-task estimates, does not jointly estimate the tasks 
for small T = 2. 

In summary, the proposed MTA estimator differs from the related work in Bayesian 
and Stein estimation in what information is assumed given (or to be estimated), and in the 
assumed statistical model. 



5.7 Comparison of MTA and Multi-Task Regularized Sample Averages 

Consider a variant of MTA in which one first obtains the sample mean yt for each task, and 
then applies ^ as follows: 

T T T 



{y t }t=l = argmin^(y t - y t ) 2 + 7^ ^A rs (y r 



Vs 



(28) 



r=l s=l 
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We refer to the resulting estimates {y}} as multi-task regularized sample averages. This 



variant is related to domain adaptation methods and transfer learning (Daume III and 



Marcu, 2006), which compute some estimates for some tasks, and then regularize estimates 



for new tasks to the previous tasks' estimates. Using similar analysis as in Section (5.1), it 
is straightforward to show that 



MSE[Yt ] = ^l±^±^ + A2 



(29) 



N(l + 2a) 2 (l + 2a) 2 ' 

To gain some insight into when MTA performs better than multi-task regularized sample 



averages, we examine the difference between (29) and (13). After some algebraic manipu- 
lations, this difference can be written as: 

r tl r „, a(N-l)(aA 2 NUa + N + l)-2a 2 (aN + a + N)) , N 

MSE[rt]-MSE[r-] = ^ N \ 1 + 3am ' +2a p -■ < 3 ») 



When the difference in (30) is positive, MTA performs better than the multi-task regu- 



larized sample averages. This occurs when 



A 2 



> 



l + ± + ^ 

1 I n * M 



N 



2a 2 N + Aa+jj' 

which is true for large values of N and/or a, and when the Mahalanobis distance between 
the two tasks is large. 



6. Multi-Task Kernel Density Estimation 

In this section we present multi-task kernel density estimation (MT-KDE), as an application 
of OSR. 



6.1 MT-KDE Objective 



Recall that for standard kernel density estimation (KDE) (Silverman, 1986), a set of random 
samples Xi £ Q,i £ {1, . . . , N} are assumed to be drawn IID from an unknown distribution 
px over some set 0, and the problem is to estimate the density over Q. Given a kernel 
function K : Q x O — > M, the standard KDE estimate for any z G SI is 



p(z) 



y( z ) 



v\ z ) 



where 



L'eny( z ') 

1 N 



(31) 



j=i 



Note that y{z) in (31) is the sample average of A^ kernel functions. When multiple 



density estimates are desired over the same O, we can replace the kernel function sample 
average in ( |31| ) with the MTA estimate, which we term MT-KDE. Suppose one is given T 
sets of samples {xu G ^}jJi for t G {1, ... ,T} drawn IID from each task, and T kernel 
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functions {Kt(xu, xtj)}, and a TxT task similarity matrix A. Then form the MTA estimates 
using ([5]): 

T N t T T 

{Vt( z )} = argmin ^ ^ (K t (x u , z) - y t {z)f + 7^ ^A-s (y r ( z ) ~ Vs{z)) 2 ■ 
{yt( z )}f=i t=\ i=l r=l s=l 

The normalized MT-KDE density estimate for the ith task is then Pt(z) = yl(z)/ Jq y^(z'). 
Note that often the normalizer is not needed in practice. If needed, it may be non-trivial 
to compute if £1 is not a finite discrete set. 

6.2 Density Estimation for Terrorism Risk Assessment 

We compare KDE and MT-KDE on a problem of estimating the probability of terrorist 
events in Jerusalem using the Naval Research Laboratory's Adversarial Modeling and Ex- 
ploitation Database (NRL AMX-DB). The NRL AMX-DB combined multiple open primary 
source:^] to create a rich representation of the geospatial features of urban Jerusalem and 
the surrounding region, and accurately geocoded locations of terrorist attacks. Density 
estimation models are used to analyze the behavior of such violent agents, and to allocate 



security and medical resources. In related work, Brown et al. (2004) also used a Gaussian 
kernel density estimate to assess risk from past terrorism events. 

We estimate a risk density for 40,000 geographical locations (samples) in a 20km x 
20km area of interest in Jerusalem. Each geographical location is represented by a D = 76- 
dimensional feature vector. Each of the 76 features is the distance in kilometers to the 
nearest instance of some geographic location of interest, such as the nearest market or bus 
stop. Locations of past events are known for three different density estimation problems: 
17 suicide bombings, 11 non-suicide bombings (e.g. car bombs), and 34 shootings. All 
the events are attributed to one of seven terrorist groups. For each of the three problems 
(suicide bombing, non-suicide bombings, and shootings), the density estimates for these 
seven groups are expected to be related, and we treat them as T = 7 tasks. 

We compared KDE and MT-KDE for this problem as follows. The set £1 consists of the 
40,000 grid points over which the density is estimated. The kernel function K was taken to 
be a Gaussian kernel on the two 76-dimensional feature vectors corresponding to any two 
grid points. Each Gaussian kernel was constrained to have diagonal covariance, where the 



variances were chosen automatically using the normal reference density technique ( Venables 



and Ripley, 2002). For the task-similarity matrix A needed for MTA, we asked terrorism 
expert Mohammed M. Hafez of the Naval Postgraduate School to assess the similarity 
between the seven groups during the Second Intifada (the time period of the dataset); 
these similarities are shown in Table [2} The similarity for the "Unknown" group in each 

1. Primary sources included the NRL Israel Suicide Terrorism Database (ISD) cross referenced with nu- 
merous open sources (including the Israel Ministry of Foreign Affairs, BBC, CPOST, Daily Telegraph, 
Associated Press, Ha'aretz Daily, Jerusalem Post, Israel National News), as well as the University of New 
Haven (UNH) Institute for the Study of Violent Groups (ISVG), the University of Maryland (UMD) 
Global Terrorism Database (GTD), and the National Counter Terrorism Center (NCTC) Worldwide 
Incident Tracking System (WITS). The terrorist event data used took place in Jerusalem during the 
Second Intifada, between October 2000 and July 2003. Only events that are believed to have occurred 
at their planned location were included, that is, bombs that were detonated early or on accident were 
not included. 
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Table 2: Hafez's Similarity Matrix A 
AAMB Hamas PIJ PFLP Fatah Forcel7 Unknown 



AAMB 





.2 


.2 


.6 


.8 


.8 


.6 


Hamas 


.2 





.8 


.2 


.2 


.2 


.4 


PIJ 


.2 


.8 





.2 


.2 


.2 


.4 


PFLP 


.6 


.2 


.2 





.6 


.6 


.5 


Fatah 


.8 


.2 


.2 


.6 





1 


.6 


Forcel7 


.8 


.2 


.2 


.6 


1 





.6 


Unknown 


.6 


.4 


.4 


.5 


.6 


.6 






column/row was set to be the average similarity of the known groups in that column/row. 
We were interested in investigating the usefulness of the provided expert similarity values, 
and thus, noting that the regularization parameter 7 acts a as multiplier on the similarities, 
we set 7 = 1. 

After computing the KDE and MT-KDE density estimates using all but one of the 
training examples {xu} for each task, we sort the resulting 40,000 estimated probabilities 
for each of the seven tasks, and extract the rank of the left-out known event. The median of 
the left-out rank is reported in Table [3j Ideally, the rank of the left-out known event would 
be very low, indicating that the location of the left-out event is at high-risk. The results 
show that the median ranks for MT-KDE are much lower than those for KDE for all three 
problems. 

Fig. [6] shows a spatial map of the density estimates for KDE and MT-KDE for the 
AAMB group, where for this figure all the past events were used as training. The green 
x's mark AAMB events. Two differences are notable. First, the MT-KDE predicts higher 
risk of AAMB events where the red x's are. Second, MT-KDE predicts less risk for many 
locations, analysis of the risk estimates for the non-AAMB groups suggests that the lowered 
risk corresponds to locations estimated to be low-risk for the other groups. 



Table 3: Median Rank of Left-Out Event Out of 40,000 



Method 



KDE 
MT-KDE 



Suicide 
Bombings 



308 

88 



Bombings 



6,512 
900 



Shootings 



11,160 
4,752 



6.3 MT-KDE For Density Estimation Over Different Domains 

We generalize MT-KDE to the problem of producing T density estimates where the ith 
density estimate is over a task-specific domain f^, with Ut domain elements that we index by 
Zf U for u = 1, . . . ,Ut- Over the T tasks there are a total of U = ^\ Ut domain elements. As 
the T domains share a feature space, we use the task-and-domain-element U xU similarity 
matrix that we implicitly index as A rv - sw to specify the similarity between the v th domain 
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KDE for AAMB 



MT-KDE for AAMB 



Figure 6: Spatial maps of the risk density estimated for suicide bombing events by AAMB, 
trained on all the past events. The green x's denote the six AAMB suicide 
bombings, the red x's denote the eleven suicide bombings by the other groups 
(specifically, by Hamas and PIJ). 



element of the rth task and the wth domain element of the sth task. Then we estimate 

T Ut N t 

argmin ^ ^J ^J (K t (x u , z tu ) - yt(z tu )) 2 
{{wt(«i«)}S=i}fc=i t=i u=i i=i 

T U r T U s 

+7^X^5Z5Z A rv;sw (yr(Zrv) ~ y S {z sw )) 2 . (32) 

r=l v=l s=l u)=l 

The resulting set of estimates {{y*(ztu)} u Li}I=i are then normalized within each task to 
form the T density estimates. That is, for the tth task, the MT-KDE density estimate is, 



Pt(ztu) 



yn z tu) 



-u, 



Z U V U y!(z tl 



6.4 Density Estimation for Mass Spectrometry Charge Classification 

We apply MT-KDE for density estimation over different domains to the problem of guessing 
the charge for mass spectrometry samples. The dataset consists of T = 25 runs of the LTQ- 
FT Ultra Hybrid Mass Spectrometer using data-dependent acquisition, and we treat each 
run as a task. Each of the runs is the result of a single protein sample of one of six species: 
C. elegans, S. cerevisiae, mouse, Leishmania, bovine, or human. Each sample is a tandem 
mass spectrum, from which was extracted a set of D = 18 features that were shown to be 



useful for a related mass spectrometry filtering problem ( Feldman , 2009 ) . Each spectrum is 



the result of extensive bombardment and fragmentation of a peptide ion. There are 24752 
spectra total, and within each task there are spectra of peptides of various charges from +1 
to +5, which we consider classes. To compare MT-KDE with KDE, we split the samples 
randomly in half into a training set (samples xu in ([32])) and a test set (samples z tu in 
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(132])). To perform randomized 10- fold cross-validation, the training set was further split 
75/25 into training and validation sets. For each of the five charge-classes, we use the MT- 
KDE objective to simultaneously estimate the probability density for all the test samples 
across the T = 25 tasks. Given the five estimated class-conditional densities for MT-KDE 
or KDE, each test sample is classified as the maximum likelihood charge-class. 

Further experimental details are as follows. Each task's data is separately standard 
normalized to have zero mean and a standard deviation of one. The similarity function in 
this case is both task and sample dependent and can be factorized as A rV]SW = A rs A^ v . sw , 
where A rs = 1 if the species of run in task r and s are the same, and otherwise, and A^ v . sw 
measures the similarity between the feature vectors z rv and z sw with a radial basis function 
with fixed bandwidth a 2 = 10. This a value was chosen such that the mean similarity of the 
resulting training similarity matrix is approximately 0.5, so that the resulting similarities 
are well scaled and not skewed too far towards zero or one. We consider the following range 
of 7: {0, 0.001, 0.01, 0.1, 1, 10} for the purposes of cross-validation; 7 = 1 was chosen as 
the optimal parameter. Note that 7 = is the standard KDE case. 

The resulting test classification errors are 41.1% for KDE and 33.3% for MT-KDE. The 
difference in these errors is statistically significant according to two one-sided Wilcoxon 
non-parametric tests; the tests were performed on the resulting zero-one vector of misclas- 
sifications (zero when the maximum likelihood class is correct, and one otherwise). 

7. Multi-Task Regularized Similarity Discriminant Analysis 

Similarity discriminant analysis (SDA) is a generative classifier that models the class- 
conditional probability distribution of the similarities between samples using discrete ex- 



ponentials (Cazzanti et al. , 2008). Like quadratic discriminant analysis, SDA may produce 
biased class models, and it has been shown that applying SDA locally (called local SDA) 
to a set of k neighboring (most similar) training samples for each test sample generally 



decreases model bias and improves performance (Cazzanti and Gupta, 2007). However, 
estimating the required discrete exponential parameters from only k neighbors can have 
problematically high variance, and in some cases the maximum likelihood parameter esti- 
mate is infinite. Here we show that multi-task averaging can be profitably applied to this 
parameter estimation task. 

7.1 MT-SDA Formulation 

In similarity-based classification, the training data is a N x N matrix S of pairwise similar- 
ities between N training samples, and their corresponding class labels yi G {1,2,..., G} for 
i = 1, . . . , N. The similarities are assumed to have a discrete (perhaps after quantization) 
domain $7. The test data is a N x 1 vector s of similarities between a test sample and the N 
training samples. SDA produces a probability that the test sample belongs to each of the G 
classes. In our experiments we use local SDA, but for notational simplicity we restrict our 
explanation to the SDA; the only difference is in practice the SDA model is trained anew 
for the k nearest-neighbors of each test sample. 



The pairwise SDA model (Sadowski et al. 2010) requires estimating G 2 discrete ex 



ponential distributions of the probability of seeing a certain similarity value between two 
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samples if one sample is from class g and the other sample is from class h. That is, the 
gth-hth distribution has the form: 

P{s gh ) = l gh e X ^\ (33) 

where s g h is a similarity between a sample from class g and a sample from class h. 



We focus on the problem of estimating the parameter A g ^ in ( 33 ) ; given \ g h the nor- 
malizer j g h will be implied. The maximum likelihood estimate X g h satisfies the empirical 
mean constraint ( Sadowski et al. , 2010[ ) 



v 9h= Yl s ahlghe X9hSsh , (34) 

s ah en 



J2a:y„= q "Eb:vu = h S ( a > 6 ) 



where v g h = — °" Va 9 k ^ b is the empirical mean with k g , kh are the number of training 



samples in class g and class h, respectively. To find the maximum likelihood estimate A^, 



one can compute the sample average on the left-hand side of (34), and then numerically 
solve for the best \ g h,7gh pair. A particular problem arises when S(a,b) is at an extreme 
value of the similarity domain tt for all a, b. For example, if the similarities can only be 
or 1, and all training sample pairs from class g and class h happen to have zero similarity, 
then the X g h will be oo. 

We apply MTA to this problem and compute multi-task regularized sample averages to 



replace the left-hand side of (34), and then numerically solve (34) to produce the multi-task 
SDA (MT-SDA) estimates of the discrete exponential parameters. Here, the multiple tasks 
are the G 2 class-pairings. 

In this application, the multi-task regularization operates across pairs of classes: the 
average similarity of samples from class g to samples of class h is regularized toward the 
average similarity of the samples from class I to class m. In our experiments, we define 
the task-similarity matrix iasa Gaussian kernel on the pairwise class sample averages, as 



follows. Let v g h denote the gth.- hth sample average given by the left-hand side of (34). The 
task similarity between the gth-hth task and the l-mth. task is taken to be: 

■rt-g—h-jl—m — c 

7.2 MT-SDA Experiments 

We compare the classification performance of the local pairwise SDA classifier to the above 



MT-SDA variant. Other similarity-based classifiers have been proposed Chen et al. (2009), 
but we restrict our comparison to local SDA to focus on the value of adding the multi-task 
regularization. 



We report classification results for six different benchmark similarity datasets (Chen 



et al. , 2009 ) . Table H^ shows the mean test error rates computed from 20 random train/test 



splits of the data, with 20% of the data held out for testing and 80% used for training. For 
each split, the training set was further split into 10 disjoint cross-validation partitions to 
select the multi-task parameter 7 and the Gaussian kernel parameter a among the possi- 
ble values {10~ 3 , 10~ 2 , 0.1, 1, 10}, and the neighborhood size k among the possible values 
{2,4, 8, 16, 32, 64, max(128, iV)}). Across five datasets multi-task local SDA outperforms 
the standard local SDA and one dataset is a tie. The improved performance is statistically 
significant according to a Wilcoxon sign rank test (p = 0.05). 
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LI. 32 


15.25 


11.56 


10.00 


6.15 


4.23 


8.95 


14.50 


11.56 


9.77 


5.52 


3.44 



Table 4: Percent test error averaged over 20 random test/train splits for the benchmark 
similarity datasets. 

Amazon Sonar Patrol Protein Voting FaceRec 
2 classes 2 classes 8 classes 4 classes 2 classes 139 classes 
Local SDA 
Multi-task Local SDA 



8. Multi-Task Local Linear Regression 

In this section, we discuss how OSR can be used for multi-task regularization of some 
parametric models. We focus on the local linear regression model, where f(x, ft) = x ft + 
fto for some local neighborhood of x G R , with parameters ft £ M. d and fto £ R. For local 
linear regression, we combine the OSR regularizer with a standard Tikhonov regularization 
of each ft S M. d to find a matrix of regression coefficients (3 with tth column ft that solve 

T Nt 



argmm^2^2(y ti - ft T x 4i - (3 to f 



IJ 4=1 i=l 

T T 

+ 1 Y, A rs(ffzr + fto " f a Zs - fto) 2 + A ^(ft - ft;) T (ft - ft;), (35) 

r,s=l <=1 

where z r is the rth task or test sample, [3g 6 K is a fixed vector of linear regression weights 
that the local regression coefficients are regularized towards, and 7 and A are regularization 
parameters. We set fto = mean({yi}), which is the optimal offset for Tikhonov regulariza- 
tion. For our experiments, we set ft; £ R rf to be the global least-squares linear regression 
coefficients found by fitting all the training data with one hyperplane (rather than locally 
fitting some of the training data) . Tikhonov regularization often increases accuracy, but it 
does not explicitly enforce smoothness of the outputs across the local neighborhoods. The 
OSR regularization may add further error reduction, and should create a smoother function 
over the domain because it explicitly forces similar (e.g. "nearby") test samples to have 
close output values. 



The OSR-LLR estimate defined by (35) has a closed-form, given in Appendix C. 



8.1 OSR-LLR Color Management Experiment 

We demonstrate the OSR-LLR on the problem of color management of a printer, which 
consists of estimating what RGB color should be input to a printer to best reproduce a 



desired CIELab color (Bala, 2003). Because such estimations must be fast, color manage- 
ment is internationally standardized by the International Color Consortium to use a look-up 
table (LUT) or lattice of points in CIELab space that can be efficiently interpolated for any 
particular test CIELab color. The color management estimation problem is to best estimate 
the RGB values corresponding to the CIELab LUT. Local Tikhonov regression, as defined 



by the first and third terms of (35), has been shown to be a state-of-the-art method for 



this estimation problem (Hrustemovic and Gupta, 2008), where the ft; are the regression 
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coefficients for the least-squares global hyperplane fit. For those experiments and these, 



"local" is defined by the enclosing k-NN (Gupta et al. 2008) such that Xt is the matrix of 



the Nt enclosing k-NN training points for test point Zt- 

In this work, we treat the estimation of the RGB value for each gridpoint in the lattice 
as a task, and simultaneously regress many of them at a time. For problems such as 
color management, minimizing the MT trace- norm, for example, is not expected to be 
helpful since the problem is already low-dimensional and the colorspace dimensions are all 
meaningful. Most importantly however, focusing on tying together the model parameters 
does not explicitly enforce smoothness between the lattice points, which the "manifold 
regularization" term in OSR-LLR does explicitly. For this application, smooth rendering 
of color gradients in the reproduced images is an important dual goal alongside estimation 



accuracy (Morovic et al. 2008). 

In our experiments, the training samples were (CIElab,RGB) pairs measured from the 
918-sample Gretag Macbeth TC9.18 RGB target for a laser printer (Samsung CLP300) and 
an inkjet printer (HP D7260). In each case, the LUT to be estimated is a regular 17 x 17 x 17 
lattice in CIElab space. For each of the 17 3 lattice points, the problem is to estimate an 
RGB triplet of outputs. As standard, we treat the R, G, and B problems independently, 
and we use a standard ICC profile architecture of ID LUTs for linearization before the 



estimated 3D LUT (Bala, 2003) 



For OSR-LLR, we regularized all the in-gamut lattice points towards each other, where 
the in-gamut points were estimated to be all lattice points within three AEjq (that is, 
Euclidean distance in CIELab space) of the convex hull of the training samples. For the 
Samsung and HP printers there were T = 611 and T = 756 estimated-in-gamut lattice 



points. For lattice points that fell outside the estimated gamut, we set 7 = in (35) which 
is equivalent to only using Tikhonov regression for those points. For the 3D LUTs, to ensure 
that distant samples do not overly influence one another, we used a simple adjacency test 
such that A rs = 1 whenever two lattice points are directly next to each other (each lattice 
point has 6 such neighbors), and otherwise, resulting in a very sparse task-similarity 
matrix A. The spacing of the ID LUTs is much finer, so we use the radial basis function 

I, _ r 1 12 
\\ z r z s 1 12 

kernel A rs = e a . 

For Tikhonov regression, we cross- validated the regularization parameter A from a range 
of {0, 1, 5, 10, 20} and chose it by minimizing the difference between the training points' RGB 
values and their estimated values for each choice of A. For the Samsung and HP printers, 
A = 10 and A = 5 were chosen for Tikhonov for each printer. For OSR, we used the A 
chosen for Tikhonov to ensure a direct comparison. To choose 7, we ran the estimated 
LUTs on a benchmark test image, and chose the largest 7 from the set {0, 0.1, 1, 10, 100} 
that preserved the black point (too large a 7 will turn black to gray because the output 
colors are smoothed towards each other). For the Samsung and HP printers, these values 
were 7 = 10 and 7 = 1, respectively. 

For all estimation methods, each dimension of Xt was normalized to have mean and 
standard deviation 1, and the associated zt was normalized accordingly. For OSR, the tth 
column block of Bq was normalized by the standard deviation of Xt. 

Because perceptions of smoothness are strongly dictated by uneven jumps in luminance 



( Wyszecki and Stiles 2000 ) , the smoothness is measured by the square root of the average 
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£2 norm of the estimated LUT's RGB differences along the L* dimension for adjacent 



nodes, compared to the smoothness baseline of even RGB differences (Morovic et al. 2008). 



These roughness results are reported in Table [5] The results show that the OSR+Tikhonov 
LUTs provided greater than just-noticeable decreased roughness (increased smoothness) as 
expected, and the statistically indistinguishable or better accuracy than only using Tikhonov 
regularization. Specifically, OSR is statistically significantly more accurate than Tikhonov 
for the HP printer according to a Wilcoxon one-sided significance test (with a p-value of 
.05), and there is no statistically significant difference between OSR and Tikhonov for the 
Samsung printer. 



Table 5: Mean roughness and A.E94 error for HP D7260 and Samsung CLP300 printers. 



HP D7260 

Roughness Mean Error 



Samsung CLP300 
Roughness Mean Error 



Tikhonov Regularization 

OSR + Tikhonov Regularization 



18 
16.2 



2.12 
2.09 



18.5 
16.3 



4.02 
4.05 



9. Conclusions and Open Questions 

The presented OSR formulation is powerful and can be applied to a broad class of esti- 
mation problems. We illustrated these attributes with three applications of OSR. First we 
presented a multi-task kernel density estimation. Then, the proposed multi-task similar- 
ity discriminant analysis showed the use of OSR for a complicated parameter estimation. 
Lastly, we showed how parametric models like local linear regression can benefit from the 
smoothing action of OSR. 

Because it is the outputs that are regularized together rather than the parameters, OSR 
can be used with different models for each of the different tasks. For example, consider the 
two-task problem of detecting anomalies from an audio stream and a synchronized video 
stream. The input feature spaces for the two tasks are different. It would be nonsensical to 
impose any constraint tying the task-specific function parameters together numerically or 
insist on a shared low-dimensional feature space. Thus, this two-task problem is difficult to 
even formulate in the MTL paradigm that dominates today's literature, but naturally fits the 
OSR paradigm. Despite the input spaces not being identical, one can nevertheless specify 
the task-sample similarity matrix A by preferentially tying together cross-task samples that 
are closer in time. 

A key question that we have only partially answered is, "How should one specify a 
task-sample similarity in general? Our T = 2 MTA analysis showed that the task simi- 
larity matrix should capture the relative in-task variance to between-task difference. Our 
experiments successfully relied on side information and kernel computations to quantify 
task-and-sample relatedness. The anomaly detection example above uses the known vari- 
able of time as side information for sample relatedness. Another approach is to let the 
task-sample similarity matrix be an additional variable to optimize over. A third approach 
is to estimate optimal task-sample similarities. In the context of MTA, we touched on the 
question of how to estimate the optimal task similarity matrix from data, analogous to an 
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empirical Bayesian method, but much theoretical and experimental research remains in that 
direction. 

We found that a negative two-task similarity may be optimal in terms of minimizing 
mean-squared error. The interpretation of a negative similarity remains an open question. 
We did not explore the potential of a negative similarity in any of our applications. Further, 
some of the other theoretical results do not hold for negative similarity, such as the convexity 
result. 

We focused on squared error for the loss and regularization functions, because of its 
usefulness and tractability. Alternative loss functions remain to be explored. Multi-task 
medians, for instance, resulting from an L\ loss may be very useful in practice. Multi-task 
learning of vectors and matrices may also benefit from the proposed regularization. We 
established some theoretical foundations for MTA, perhaps the simplest OSR formulation. 
Our main theoretical result is that the proposed MTA estimates are convex combinations of 
sample averages. Our proof directly analyzes the Laplace expansion of the matrix inverse. 
An open question is whether there is a simpler proof of this result. 
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Appendix A: Proof of Lemma 
Let B = I - A. The (t, s)th entry of B is 



Bts ~ \ l{A ts +A st ) 



1 if t = s 



N t +irt 

where vr t = 7 ^(A ts + A st ). 



if t^ S, 



Next, we use the Gershgorin Circle Theorem (Horn and Johnson 1990). Let the sum of the 
off-diagonal entries for the tth row be 

R t = Y\B ts \= A / f ■ (36) 

The Gershgorin disk D(Bu, Rt) is the closed disk in C with center Bu and radius Rt- From 



(36), Rt < 1 if all the entries of A are non-negative. Then since Bu = 1 for all t, every 
Gershgorin disk is contained within the positive half-plane of C. Thus by the Gershgorin 
Circle Theorem, the real part of every eigenvalue of matrix B is positive, and therefore the 
determinant of B is positive. This completes the proof. 
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For the T = 2 case, we have that A\ 2 = A 2 \ = jv+2a-y • ^he ma tri x I — Ais not invertible 
if and only if its determinant is zero: 

Det(/-i) = l-f ^ )( ^ ) = 



N 1 N 2 + 2(JVi + N 2 )a>y + 4a 2 7 2 = 4aV 

-N t N 2 



2(N 1 + N 2 ) 7 ' 
as was to be shown. 

Appendix B: Proof of Theorem 

By the theorem's assumptions, the entries of A are non-negative, and thus by the lemma 



in Section 4.2, (/ — A) -1 exists. Recall that a matrix inverse can be written as the product 
of the determinant and the transpose of the matrix of the cofactors. Because there is no 
set ordering of the tasks, we can simultaneously swap the rth and sth row and column of 
I — A; this means that, without loss of generality, it is enough to prove the theorem for the 
first task estimate: 

T 

1 N 

((/ - Ar'yh = — - £ C rl —^—y r , (37) 

det(J - A) ~ N r + TT r 

where C rs is the (r, s)th cofactor, and by the theorem's assumptions N r > 1 for any r = 



1, ... ,T. To show that (37) is a convex combination of the T sample averages, we must 
show that the coefficients on the y r are non- negative and sum to 1. 
That is, we must show that 

det(J-A)^ rl N r + TT r 



or equivalently, that 



T 

r=l 



For ease of notation, let S s t = 7(A s t + At s ) and Su = 0. Since 



r=2 



the condition (38) can be re-stated 



TTlCll _ V^ (S'lr + N r )C T \ _ 

iVi+TTl ^ N r + Trr 
r=2 



2(i 
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Factor out l/(Ni + 7Ti) from (39) to produce the equivalent condition 

T 



TTlCil " ^ 



(Si. + iV^^ + TrOai 



r=2 



AV 



0. 



7T r 



(40) 



As a first step to showing (40), let's look at C\\. It is the determinant of the (T — 1) x (T — 1) 
minor 



M- 



n 



1 

S23 

N a +ir s 

■5*2 T 
Nt-\-t^t 



S23 

N 2 +TT 2 
1 






S24, 
N2+-K2 

■5*34 
N3+T3 



^2T 
^2+712 

Ar 3 +7r 3 



J (T-1)T 

Nt+tct 



and therefore it is 



C 



11 



n 

s=2 



1 



A r ,+7T S 



-/V2 + VT2 — 5*23 

— S23 ^3 + 7T 3 



-5*24 
S34 



-S2T 



-S3T 



_5 2T 
_5 3T 



(41) 



We will call the matrix whose determinant featured in Eq. (41 ) by D and the corresponding 
determinant by C' u . That is, 



D 



N2 + 7T2 —S23 —S24 " " " —S2T 

— S23 N 3 + 7T 3 — S34 • • • — S3T 

—S2T —S3T • • • — 5'(t-i)t -^r + ^T 



and C' n = det(D). 

Next let's look at the cofactors C r \. They are the product of (— l) r+1 and the determi- 
nant of the (T — 1) x (T — 1) minor M T \ which we get by removing the rth row and the 1st 
column of matrix I — A. If we look at the determinant of M r \ we see that we can factor 
out 



1 



N k +TT k 

JVl + TTi 
N r + 7T r 



from the rows k < r and « , from the rows k > r. As a result we get that 



n 



1 



iV s +vr s 



N2 + 7T2 — S23 — S; 



21 



-512 



•Sis 



-52T 



Sit 



-S. 



2(T-1) 

-S2T 



•S(t-2)(T-1) Nt-1 + 7TT_l — Sr(T-l) 

~S(j'—2)T ~S(T—1)T Nt + 7T2 1 



(42) 



where the row [S12 S13 • ■ • Sit] is i n the (r — l)th row of the determinant. We will call the 



determinant in Eq. (42) C' rl . Notice, that we can construct C' rl from C'n by replacing the 
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(r — l)th row of C' xl with the vector [Si 2 S13 • • • Sir]- Using the expressions for C' u and 



C' rl , we can derive a simpler constraint that is equivalent to Eq. ( 40 ) : 



7TlC[ 



11 



T 



(Si, + N,.)CU = 0, 



or equivalently 



Y, Si s C' n - J^(S lr + N r )C' rl = 0. 



(43) 



r=2 



For each s value we will express the matrix that produces the determinant C' X1 slightly 
differently. For any s let C[i be written as the determinant of a matrix where we replace 
the (s — l)th row by the sum of all the rows (which does not change the determinant), that 
is, let C'u be written 



N 2 + TT 2 



-S23 



-S 



21 



N 2 + S 12 iV 3 + Sl 3 N3 + S13 



-S2T 



-S3T 



-s 



(T-l)T 



— S2T 



N T + Sir 



iV2 + vr 2 



As the last step we show below that for each r and each s in {2, . . . , T} the (s — l)th term 
of (Si r +N r )C' rl cancels the (r— l)th term of SisC^. This gives a one-to-one correspondence 



between the terms of the sums in Eq. (43) and as a result the difference in Eq. (43) is 0. 
To show that, let r,s £ {2, . . . , T} and consider the Laplace expansion of C' rl along the 

(r— l)th row. The s— lth term in that expansion is ( — 1) 

where Mi r _ 1 w 1-1 is the minor of D we get by erasing the (r 



(Si r +iV r )Si s det (M^)^) 
l)th row and the (s — l)th 
column of D. Consider also the Laplace expansion of SisC'n along the (s — l)th row and 
consider the (r — l)th term of this expansion: (— l) r+s_2 Si s (A^ r + Si r )det I Mi 



Mi 



'(«-l)(r-l) 



and since the determinant of a matrix 



Since D is symmetric Mj s _ 1)(r ._ 1} - ^« (r ._ 1)(a _ 1} 

and the determinant of its transpose are the same we have that 

(-l) r+s - 2 (Si r + N r )S ls det (M^)^) = (-iy+ s - 2 Su(N r + S lr ) det (m[ 



s-l)(r-l) 



At this point, we have shown that the coefficients involved in Eq. (37) sum to 1. Until 
now, we have only needed the fact that S s t = Sts an d that I — A is invertible. The 
theorem's assumption that A has non-negative entries is sufficient but not necessary for the 
invertibility of / — A. 



In order to complete the proof, we must also show that the coefficients in (37) are non- 



negative, and for that we explicitly need the theorem's assumption that the entries of A are 
non-negative. 



From the proof of the lemma, we already know that the determinant in (37) is positive, 



and by the theorem's assumptions N r /(N r + 7i>) > 0. Thus, we only need to show that 
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all the C r \ are non-negative for r £ {2, . . . , T}. If we look at the Eq. (42) we see that it 
is enough to show that C' rl is non-negative in order to show that G r \ is non-negative. To 
show that C' rl is non-negative we rewrite it as follows. Replace the (r — l)th column with 
the sum of all the columns: 



N 2 + 7T2 



S 



12 



-Szr 



-S23 



5*13 



-S 2 (T-1) S 2 (T-1) 



S 2 T 



N 2 + Si 2 



7<1 
Nt-1 + Si(T-l) 

Nt + Sit 



-S 2 t 



Sit 



-St(t-i) 
Nt + vr t 



(44) 



This replacement does not change the determinant. Let's call the matrix featured in Eq. 



(44) by F. If we apply the Gershgorin Circle Theorem to matrix F, and again denote 



the sum of the absolute values of the off-diagonal elements in row t by Rt, we see that 
F tt — Rt = Str > for all t / r — 1 and i^r— i)(r— 1) ~~ Rr-l = Sir > 0- Thus the distance 
from the center of each of the Gershgorin discs to the origin is at least as large as the radius, 
and thus by the Gershgorin Circle Theorem, the real part of each eigenvalue of F must be 
non-negative, which implies that the determinant of F (which is C' rl ) is non-negative. 



Appendix C: OSR-LLR Solution 



We give a closed- form solution to (35). Additional notation is needed. We define the 
following: 



Xt£ 



x e 



z e 



Ye 
Be 



vDxN t 



flxNt 



the matrix with columns xu, for i — {1, . . . , Nt} 
the row vector with entries y ti 



v a block diagonal matrix with blocks X t , for t = {1, . . . , T}, with N — 2t=i ^ 

a block diagonal matrix with single column blocks Zt 
a block diagonal matrix with single row blocks yt, for t — {1, . . . ,T} 
a block diagonal matrix with single column blocks fit 



L — A d — A the graph Laplacian matrix (Chung 20041 of A, where A d is a diagonal matrix with entries 

A rr = / , t — 1 Art 

-Bo G E TxJV a block diagonal matrix with single row blocks containing /3 t o repeated N t times 



0o 6 



a diagonal matrix, with (t, t)th entry /3to 



Bq e R dTxT a block diagonal matrix with single column blocks 0a 
1 a column vector of ones. 



The OSR-LLR objective given by (35) can be re- written in this matrix notation: 



arg min V (Y - (B 1 X + B ))(Y - (B 1 X + B )y 1 + 7 1 J (B 1 Z + /3 )L( J B J Z + fa) 1 1 

B 

+ X1 T (B-B G ) T (B-B G )1. 
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The OSR-LLR problem is convex, and thus solved by taking the partial derivative of the 
objective with respect to B and setting it equal to zero. To do so, we use the following two 



identities (Petersen and Pedersen, 2008): 



da T B T DBa tr!r „, D T , 8a T B T b , T 
_ = (D T + D)Baa T and QR = ba T . 

Thus the minimizer matrix B* must solve: 

= -2XY T 11 T + 2XX T B*11 T + 2XB%11 T + 2~fZLZ T B*ll T + 2-fZLf3jll T 
+ 2\B*11 T -2XB G 11 T . 

Because of the special structure of the terms, the above can be simplified without destroy- 
ing information by right-multiplying both sides of the equation by ^1, getting rid of the 
rightmost 1 T . Note that the remaining B*l is actually a dT x 1 column vector of all the 
Pt, stacked. Let B* = B*l, then, 

B* = (XX T + ~/ZLZ T + \I)- l (XY T - XBl - -yZLffi + XB G )1, (45) 

where the inverse exists as long as A > because XI is positive definite, while XX + r yZLZ 
is PSD for symmetric A. 
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